Building a LINQ query macro in Rust because I can

C# has this module called LINQ, which has all the common functional operations, except named differently, because Microsoft. For example, it has .Select(f), which is map. On top of that, it has special syntax that allows you to write, essentially, queries which desugar into chains of these functional calls.

var items = from item in someList
			where item % 2 == 0
			orderby item descending
			select item * item;

We could write something like that, and this results in an iterator with all the even values from someList sorted in descending order, then squared. Notably, this is an iterator—no actual work happens yet, other than building a representation of what we want to produce.

And, I figured, I can build something similar in Rust, using only declarative macros. I've written several complex Rust macros before, so starting out I thought it'd be a relatively straightforward exercise. It wasn't.

This isn't gonna be a thorough macro tutorial, but we're gonna go over one very powerful pattern, as well as touch on some limitations and practical concerns when creating your own syntax like this.

So—let's do it!

Declarative macros

In Rust, there's a macro called macro_rules which allows you to create very specific, limited macros. They're called "declarative" because all we're gonna do is declare a shape of input, and if that shape is matched, produce one specific output based on it. It's just pattern-matching—something you're probably well-acquainted with if you've written any Rust.

Whenever I start writing a declarative macro, I like to start with a "minimal viable use" of it; how I would call it if it already existed:

let items = query! {
	from x in some_list
	where x % 2 == 0
	select x * x
};

I skipped ordering for now, because there's no built-in for that in Rust; we'll get to it later. So, for the very first version of our macro, let's just hardcode this entire structure:

macro_rules! query {
	(from $var:ident in $source:expr
	 where $filter:expr
	 select $map:expr) => {
		 $source.into_iter().filter(|$var| $filter).map(|$var| $map)
	 };
}

That creates the macro, with one pattern matching case. It matches the pattern from $var:ident in $source:expr where $filter:expr select $map:expr, which very closely follows our example invocation. Each $name:type pair matches a specific kind of legal Rust syntax, which can then be spliced back in in the output using $name.

For example, $source:expr matches a single expression as understood by Rust. It can be as simple as a literal value, or as complicated as a deeply nested block. $var:ident matches an identifier—in this case, just the variable name x.^[1]

So, all good, yeah? Except...

Oops!

This doesn't work.

Conditions and restrictions apply

`$source:expr` is followed by `where`, which is not allowed for `expr` fragments allowed there are: `=>`, `,` or `;`

`$filter:expr` is followed by `select`, which is not allowed for `expr` fragments allowed there are: `=>`, `,` or `;`

As I mentioned before, declarative macros are pretty limited. One of those limitations is that the compiler needs to be able to unambiguously parse the contents. Reasonable restriction, but a restriction nonetheless.

In this case, it doesn't like that where is right after an expression, because any given token might be part of that expression, and we're deliberately not going into a deep web of decisions to make it one or the other. Instead, there's a simple rule: There's only a limited amount of tokens allowed right after an $_:expr; an arrow, a comma, or a semicolon.^[2]

In fact, there's already a crate called linq that simply chooses to use the comma. Very pragmatic! I, however, am not pragmatic. But we do have to decide something. My decision is gonna be a bit unwieldy, but practical in other ways.

Delimiters

let items = query! {
	from x in { some_list }
	where { x % 2 == 0 }
	select x * x
};

macro_rules! query {
	(from $var:ident in { $source:expr }
	 where { $filter:expr }
	 select $map:expr) => {
		 $source.into_iter().filter(|$var| $filter).map(|$var| $map)
	 };
}

One very important thing to keep in mind about macro_rules is that the contents of an invocation still have to be legal in some sense. One particular way this manifests is that parentheses, brackets, and braces need to be matched. We can use this to disambiguate things the compiler would struggle with (or just complain about).

In this case, surrounding things with braces makes it just look like blocks are required. Unfortunate, but not exactly obscure, all things considered. We don't need it for select, because it's at the end, so no need to disambiguate with tokens after it. Though, because blocks are still expressions, we could add them and it'd still work!

When we expand the macro invocation, the result is pretty much what we'd expect:

some_list
    .into_iter()
    .filter(|x| (x % 2 == 0))
    .map(|x| (x * x))

Macro-generated code usually has a lot of extra pairs of parentheses; don't worry about it.

Folding things together

This is neat, sure, but we're only doing one very specific kind of query; we can't even leave things out! Let alone deal with monstrosities like this:

let items = final_query! {
	from a in { 0..10 }
	from b in { 0..10 }
	let unequal = { a != b }
	where { unequal }
	orderby { *a } ascending
	group { b } by { *a } into { g }
	select g
};

Things can be in different orders, compositions and overall it's just a lot of complexity.

But, if we squint hard enough, it's just a fold! We have a list of clauses which are consumed one at a time to build one big expression. All we need is a way to build a recursive fold within Rust's declarative macro mechanism. Peanuts! Totally.

Programmer jokes are the worst

Turns out, you can do that. Somebody even gave the pattern a name! Unfortunately, that name is "incremental TT munchers".

But, the idea is relatively simple: We only match the first couple terms that interest us, and keep dragging along the rest as a match of type tt (hence the name). $_:tt matches pretty much anything and passes it along unharmed. Let's do a dumb example to get the idea.

add!(1, 2, 3, 4);

And we want this to expand to:

1 + 2 + 3 + 4

It's a very simple example of the same mechanism we want to use for our more complex macro. Reading the page on grumble incremental TT munchers, the pattern seems very simple to implement. Let's just follow the example 1:1 with minor adjustments:

macro_rules! add {
	() => {};
	($number:literal $($rest:tt)*) => { $number add!($($rest)*) };
	(, $number:literal $($rest:tt)*) => { + $number add!($($rest)*) };
}

Neat, now lets just fully expand that, and...

1 add!(,2,3,4)

Oh.

Oops!

This doesn't work.

Conditions and restrictions (still) apply

Here's another one: Everything produced by a macro must be a full, self-contained bit of Rust code. Maybe alarm bells went off when you were reading the definition of add, but I don't blame you if they didn't; it's easy to overlook when you're not aided by a compiler.

macro expansion ignores token `add` and any following
the usage of `add!` is likely invalid in expression context

Basically, just writing two things next to each other isn't valid Rust. 1 2 is not valid Rust, and that's pretty much what we were doing when writing $number add!($($rest)*). And the last arm returns { + $number adder_chain($($rest)*) }, which just seems like utter gibberish in retrospect. + 2 isn't a self-contained piece of code^[3].

The missing piece

If you're comfortable with ideas from functional programming, it's probably clear what we're missing for our fold to be complete. What we wrote is actually a map, and that simply doesn't work for producing one expression out of a series of inputs. Folds have an accumulator.

macro_rules! add {
	([$acc:expr]) => { $acc };
	([$acc:expr] , $number:literal $($rest:tt)*) => {
		add!(
			[$acc + $number]
			$($rest)*
		)
	};
	($number:literal $($rest:tt)*) => {
		add!(
			[$number]
			$($rest)*
		)
	};
}

We now have three match arms:

only accumulator—the base case in our recursion, which simply returns the accumulator;
accumulator with another , $number, which adds this new number to the accumulator and keeps going;
the initial call syntax, which translates itself to the accumulator-based syntax.

So, if we expand:

add!(1, 2, 3, 4)

We get:

(((1 + 2) + 3) + 4)

So yeah, lotta extra parentheses, but functionally equivalent to the code we wanted. Just a chain of additions.

Now's the part where it gets slightly complex, unlike before

There's a few more bits we're missing conceptually, but I hope you get the general idea: Take each clause one by one, building up one big expression chain in an accumulator, until we're done and can return it. Here's my original "minimum viable use" again:

query! {
	from x in { some_list }
	where { x % 2 == 0 }
	select x * x
}

We just need to implement from, where and select for now.

macro_rules! query {
    ([$var:ident] [$acc:expr] select $select:expr) => {
        $acc.map(|$var| { $select })
    };
    
    ([$var:ident] [$acc:expr] where { $filter:expr } $($rest:tt)*) => {
        query!([$var] [$acc.filter(|$var| { $filter })] $($rest)*)
    };

    (from $var:ident in { $source:expr } $($rest:tt)*) => {
        query!([$var] [$source.into_iter()] $($rest)*)
    };
}

Alright, that's not that bad. Here's the produced output when expanding the MVU:

((some_list.into_iter()).filter(|x| (x % 2 == 0))).map(|x| (x * x))

And, as proof, we can even introduce an additional filter clause:

query! {
	from x in { some_list }
	where { x % 2 == 0 }
	where { x % 3 == 0 }
	select x * x
}

And the expansion has it:

(((some_list.into_iter()).filter(|x| (x % 2 == 0))).filter(|x| (x % 3 == 0))).map(|x| (x * x))

Now that we're convinced it works, let's address the $var in the room: There's not just an accumulator, but an extra $var:ident we're dragging along through all of our recursive calls. Not unsurprisingly, it's so that x is accessible in the bodies of the where/select clauses we write.

Taking stock

So, we have from(1)^[4], where and select. What's missing?

let
from(2+)/group/orderby

let is very interesting: It lets you (hah) introduce a new binding, visible to subsequent clauses.

The other three I grouped together, because their implementations are very similar, and mostly boil down to implementing the underlying functionality outside of the macro. There's a really key takeaway that I internalized from implementing them, so we'll take a look at them later, but let's focus on let for now.

I was lying, now it gets complex.

query! {
	from x in { some_list }
	let is_even = { x % 2 == 0 }
	where { is_even }
	select x * x
}

It's pretty clear what this is supposed to do, but how to implement it? If you remember, the name x is only available further down because we're keeping it in $var, to splice in wherever it needs to become available again. We'd need to do something similar for the names in let bindings.

We'll have to power up $var, turning it into full-fledged accumulator in it's own right. The matcher for that is pat.

macro_rules! query {
    ([$var:pat] [$acc:expr] select $select:expr) => {
        $acc.map(|$var| { $select })
    };
    
    ([$var:pat] [$acc:expr] where { $filter:expr } $($rest:tt)*) => {
        query!([$var] [$acc.filter(|$var| { $filter })] $($rest)*)
    };

    (from $var:pat in { $source:expr } $($rest:tt)*) => {
        query!([$var] [$source.into_iter()] $($rest)*)
    };
}

$_:pat matches patterns; so stuff like (a, b). Or just a. For what we're trying to do, it's much more powerful than $_:ident! As a bonus, this works for free now:

query! {
	from (a, b) in { list_of_pairs }
	select a + b
}

It expands into following extremely graceful snippet:

(list_of_pairs.into_iter()).map(|(a, b)| (a + b))

You see how the pattern stored in $var serves as a destructuring inside the closure that the body of select gets transformed into? This is it. That's the trick that'll let us let.

macro_rules! query {
    ...

	([$var:pat] [$acc:expr] let $name:ident = { $value:expr } $($rest:tt)*) => {
		query!([($var, $name)] [$acc.map(|$var| ($var, $value))] $($rest)*)
	};
    
    ...
}

One new match. The left-hand side should be pretty routine by now. The right-hand side is where the magic is happening. As promised, we're making $var a fully-fledged accumulator! patterns are recursive, you can build bigger patterns out of smaller ones! If we expand this we get...

query! {
	from (a, b) in { list_of_pairs }
	let sum = { a + b }
	select sum
}

expected expression, found `(a, b)`

...o-oh.

Conditions and restrictions never stopped applying

If the compiler expanded the macro, we'd get this:

list_of_pairs.into_iter().map(|(a, b)| ((a, b), a + b)).map(|((a, b), sum)| sum)

Which, let's just admire that for a moment. Isn't it cool how that nested $var lets us destructure everything in the final map, so that sum is visible again? I love it so much. ...except, it isn't real. It didn't compile. What does that cryptic message mean, anyway? Well, the problem is actually both quite subtle, and trivial to sidestep.

You know how 1 + 1.0 will fail to compile, because even though both values are technically 1, they're of different types? That's basically what's happening here, but on a syntax fragment level.

When we substitute in $var, which is a pat, it must be used as a pat. Even though (a, b): expr and (a, b): pat look the same to us, just (a, b), to the compiler at that point they're of different types.

list_of_pairs.into_iter().map(|(a, b)| ((a, b), a + b)).map(|((a, b), sum)| sum)
                                        ~~~~~~
	                                    this is a `pat`, but we want an `expr`

It's almost silly what we do to fix this, to be honest:

macro_rules! query {
    ...

	([$var:pat] [$acc:expr] let $name:ident = { $value:expr } $($rest:tt)*) => {
		query!([($var, $name)] [$acc.map(|old @ $var| (old, $value))] $($rest)*)
	};
    
    ...
}

((list_of_pairs.into_iter()).map(|old @ (a, b)| (old, (a + b)))).map(|((a, b), sum)| (a + b))

Here's the relevant Rust Book chapter:

The Rust Programming Language: Pattern Syntax

The at operator @ lets us create a variable that holds a value at the same time as we’re testing that value for a pattern match.

So old contains (a, b), but as an expression—exactly what we wanted. So now we have working let bindings! Here's the full macro we have so far:

macro_rules! query {
    ([$var:pat] [$acc:expr] where { $filter:expr } $($rest:tt)*) => {
        query!([$var] [$acc.filter(|$var| { $filter })] $($rest)*)
    };

    ([$var:pat] [$acc:expr] let $name:ident = { $value:expr } $($rest:tt)*) => {
        query!([($var, $name)] [$acc.map(|old @ $var| (old, $value))] $($rest)*)
    };

    ([$var:pat] [$acc:expr] select $select:expr) => {
        $acc.map(|$var| { $select })
    };

    (from $var:pat in { $source:expr } $($rest:tt)*) => {
        query!([$var] [$source.into_iter()] $($rest)*)
    }
}

I don't know about you, but I think that's remarkably little code for what we accomplished so far!

A brief sidequest: Hygiene

The more mischievous among you might now ask, "what if we cause a name clash with old"?

let items = query! {
    from (old, new) in { [(1, 2), (2, 4), (3, 5), (4, 7)] }
    let square = { old * old }
    select square
};

for item in items {
	println!("{item}");
}

A simple macro that takes a list of pairs, and returns the squares of the first item of each pair. Let's look at the expansion:

(([(1, 2), (2, 4), (3, 5), (4, 7)].into_iter())
	.map(|old @ (old, new)| (old, (old * old))))
    .map(|((old, new), square)| square)

That... looks bad! old is defined twice, to mean different things—once the first item of the pair being mapped over, and once the entire pair! And then the output, (old, (old * old))—that seems very wrong, doesn't it?

Just to make totally sure, let's actually run it.

> cargo run
1
4
9
16

O-oh. That's the correct result! How is this possible? If I try to just run the expanded code directly, the compiler shouts at me! Multiple times!

identifier `old` is bound more than once in this parameter list
used as parameter more than once

mismatched types
expected type `{integer}`
  found tuple `(_, _)`

Well, that's because macros shower regularly.

There's this concept called "macro hygiene", and Rust is very strict about it. Turns out, in this expanded version:

(([(1, 2), (2, 4), (3, 5), (4, 7)].into_iter())
	.map(|old @ (old, new)| (old, (old * old))))
    .map(|((old, new), square)| square)

old: $var and old: macro-intern are two different variables, they just look the same to us in the expanded output. Another kind of type mismatch, but this time working to our advantage!

.map(|old @ (old, new)| (old, (old * old))))
	  ^^^    ^^^        ^^^^   ^^^   ^^^
	  |      |          |      |-----| both from $var
	  |      |          |
	  |      |          | macro-internal
	  |      |
	  |      | from $var
	  |      
	  | macro-internal

There's nothing the user of your macro can do to gain access to a name you defined inside your macro. This is a very important rule to remember! To construct a simple example:

macro_rules! secret_x {
	($($stuff:tt)*) => {
		let x: i32 = 5;
		$($stuff)*
	};
}

fn main() {
	secret_x!({
		println!("{x}");
	});
}

This doesn't compile! "Cannot find value of x". Even though the hand-expanded version would trivially compile. As I said, there's no way to get something you hard-named inside your macro, outside the macro.

The reason our macro works is because we splice names given to us back into the final expression. For contrast, this works:

macro_rules! secret_var {
    ($var:ident, $($stuff:tt)*) => {
        let $var: i32 = 5;
        $($stuff)*
    }
}

fn main() {
    secret_var!(x, {
        println!("{x}");
    });
}

Now, x is visible, because we didn't use a macro-internal name, but one given to us in the invocation.

Nifty, huh?

Sometimes, there's more to macros than macros...

So that leaves us with from(2+), group and orderby. I'm gonna do orderby, and then just put in a link to the final code, ok? They're relatively similar, implementation-wise. Also, no bothering me about these—they're not meant to be good, just functional enough to power the macro.

So, there's more or less three valid forms of the orderby clause:

orderby { expression } ascending
orderby { expression } descending
orderby { expression }

Our $acc is an Iterator at every step, so we need to build an Iterator we can return that implements the behaviour we want.

enum Sorted<I, E, F> {
	Ready(Option<I>, F),
	Active(Vec<E>),
}

This'll be our iterator type. We wanna split it into two cases so that we can delay the whole sorting logic into the iterator .next() call, to stay true to the idea that constructing a query doesn't consume it. Three type parameters is kinda rough, but it's pretty clear what we need them for—I is the iterator we're wrapping around, F is the function producing our sort keys (it's orderby after all), and E is the type of our elements.

Anyway, time for some light and breezy, regular non-macro Rust code:

impl<I, E, F, K> Iterator for Sorted<I, E, F>
where
	I: Iterator<Item = E>,
	F: FnMut(&E) -> K,
	K: Ord,
{
	type Item = E;

	fn next(&mut self) -> Option<Self::Item> {
		match self {
			Self::Ready(items, key) => {
				// .take()ing an `Option` lets us avoid multiple borrow
				// problems when assigning to *self later.
				let mut items: Vec<_> = items.take()?.collect();
				items.sort_unstable_by_key(key);
				*self = Self::Active(items);
				self.next()
			}
			Self::Active(items) => items.pop(),
		}
	}
}

Sorry about that. It's probably a lot to take in at once, so let's take it a step at a time.

impl<I, E, F, K>

We're about to write something where there's 4 relevant generic types. Three of those we already know the purpose of, but K is new—it's simply the type of the keys we sort by, though.

Iterator for Sorted<I, E, F>

We're adding the Iterator trait to our Sorted type; and we're correlating three of the generic types with the types used in Sorted.

where
	I: Iterator<Item = E>,
	F: FnMut(&E) -> K,
	K: Ord

This is us telling Rust what I already explained to you earlier about the nature of these types. I is an iterator we're wrapping around, and it produces items of type E. F is a function that produces Keys to sort by; which means that K needs to be able to be compared, which Ord does.

The rest of the implementation is relatively rote, which gives us a minimally-implemented sorted iterator. Back to macro stuff.

...other times, there's not.

let list_of_pairs = [(1, 4), (2, 1), (3, 3), (4, 2)];
let items = query! {
	from (a, b) in { list_of_pairs }
	orderby { *b } descending
	select a
};
for item in items {
	println!("{item}");
}

We want this to print, in order, 1, 3, 4, and 2. So let's do the macro part:

macro_rules! query {
	...

	([$var:pat] [$builder:expr] orderby { $($key:tt)* } descending $($rest:tt)*) => {
		query!([$var] [Sorted::Ready(Some($builder), |$var| { $($key)* })] $($rest)*)
	};
	
	([$var:pat] [$builder:expr] orderby { $($key:tt)+ } ascending $($rest:tt)*) => {
		query!([$var] [Sorted::Ready(Some($builder), |$var| std::cmp::Reverse({ $($key)* }))] $($rest)*)
	};

	([$var:pat] [$builder:expr] orderby { $($key:tt)* } $($rest:tt)*) => {
		query!([$var] [Sorted::Ready(Some($builder), |$var| std::cmp::Reverse({ $($key)* }))] $($rest)*)
	};

	...
}

You may look at all this code duplication, and think there must be a better way. There might be, but probably isn't. I've always had bad experiences with optional matches near the end; but hey, if you mess around with it and find something, let me know!

Anyway, the body is verbose, but relatively trivial. We just construct our new iterator using all the same tricks as before. If you didn't know about std::cmp::Reverse, it's simply a wrapper around any comparable value, and it compares the opposite way. So 4 > 2, but Reverse(4) < Reverse(2). It's useful for flipping sorts.

Since the implementation is so easy, I'm sure there's nothing that can go wrong when we use it!

the method `map` exists for enum `Sorted<IntoIter<({integer}, {integer}), 4>, _, [closure@main.rs:13:61]>`, but its trait bounds were not satisfied
the following trait bounds were not satisfied:
`Sorted<std::array::IntoIter<({integer}, {integer}), 4>, _, [closure@src\main.rs:13:61: 13:67]>: std::iter::Iterator`
which is required by `&mut Sorted<std::array::IntoIter<({integer}, {integer}), 4>, _, [closure@src\main.rs:13:61: 13:67]>: std::iter::Iterator`

Oh, sweet, we've made a different part of the compiler mad! This time it's the type inference. Basically what it is saying is that it doesn't believe us that the type parameters we're giving Sorted<I, E, F> actually fulfill the requirements we laid out in our Iterator implementation:

where
	I: Iterator<Item = E>,
	F: FnMut(&E) -> K,
	K: Ord

It says that as "the trait bounds were not satisfied" for Sorted(...): std::iter::Iterator. Basically, Sorted does implement Iterator, but not unless the types meet these criteria, and the compiler knows that they meet these criteria.

How do we convince it? By introducing a point where it can check itself.

impl<I, E, F, K> Sorted<I, E, F>
where
	I: Iterator<Item = E>,
	F: FnMut(&E) -> K,
	K: Ord
{
	fn new(iterator: I, key: F) -> Self {
		Self::Ready(Some(iterator), key)
	}
}

Here we have a new that only accepts arguments that produce a Sorted that actually fulfills all the trait bounds for its Iterator implementation—notice it's the exact same where block as before. Now we simply need to adjust our macro to use new instead of directly constructing (a non-type constrained) instance:

macro_rules! query {
	...

	([$var:pat] [$builder:expr] orderby { $($key:tt)* } descending $($rest:tt)*) => {
		query!([$var] [Sorted::new($builder, |$var| { $($key)* })] $($rest)*)
	};

	([$var:pat] [$builder:expr] orderby { $($key:tt)+ } ascending $($rest:tt)*) => {
		query!([$var] [Sorted::new($builder, |$var| std::cmp::Reverse({ $($key)* }))] $($rest)*)
	};

	([$var:pat] [$builder:expr] orderby { $($key:tt)* } $($rest:tt)*) => {
		query!([$var] [Sorted::new($builder, |$var| std::cmp::Reverse({ $($key)* }))] $($rest)*)
	};

	...
}

Aaaaaand:

> cargo run
1
3
4
2

Sick.

Unused variable: 'b'

By this point, if you've been running the examples, you might have noticed that the generated code produces a lot of "unused" warnings. That's because they're bound and re-bound every closure in the entire chain, and they'll go unused in at least some of those. It happens.

There's #[allow(unused)].

And that's everything!

Oops!

While writing this section, I actually remembered that orderby clauses in LINQ allow you to list multiple things to sort by—if one key is equal, sort by the next one. Implementing this would actually be a great exercise to be left the reader, though, not least because I'm 27k characters into writing this.

Also, because fun related fact, tuples in Rust implement Ord if all the types they contain do, with these exact semantics. That means you can simply use a tuple of values as the key and trivially reimplement the same thing!

Some sort of conclusion

Here's the source code I produced whilst writing this, as promised. Some parts deviate very slightly from what I've presented in this article, but overall, everything should be recognizable.

Looking back, we've covered quite a lot:

declarative macros;
going from a simple macro to a recursive one;
how while powerful, the macro system puts some limitations on your choice of syntax;
identifier hygiene in macros;
some small nifty tips and tricks, both for macros and regular code;
some things about traits, too!

Definitely need to keep scope creep in check next time. But I think it makes sense, since this was a very exploratory kinda thing for me.

I had a ton of fun writing both the code and the article! Please do let me know which parts are incohorent or boring; if I'm gonna write more semi-technical things, might as well become better at it.

A list of all the different kinds of things you can match is documented in the Rust Book. ↩︎
The Rust Book of course has a chapter detailing what can follow what, and why. ↩︎
Well, technically +2 is a self-contained piece of code, but not in the sense we're talking about! Unary plus is a completely different piece of code than "half of a binary plus", which is what we'd've wanted. ↩︎
from(1) refers to the from at the beginning of the query. from(2+) are subsequent clauses of the same form. In LINQ, those result in going through the Cartesian product of both sources. That's why from(1) and from(2+) have different implementations. ↩︎